Machine learning for Asian language text classification
Identifieur interne : 000495 ( Main/Exploration ); précédent : 000494; suivant : 000496Machine learning for Asian language text classification
Auteurs : Fuchun Peng [États-Unis] ; Xiangji Huang [Canada]Source :
- Journal of Documentation [ 0022-0418 ] ; 2007-05-01.
English descriptors
- Teeft :
- Algorithm, Asian language, Asian language text, Bayes, Bayes model, Bayes result, Best compression, Best result, Binary problems, Byte level models, Categorization, Certain level, Character level, Character level features, Chinese data, Chinese data table, Chinese experiments, Chinese text, Chinese word segmentation, Class label, Computational linguistics, Entropy, Exponential form, Feature engineering, Feature selection, Feature space, Higher order models, Ieee transactions, Individual scores, Information retrieval, International conference, Japanese data, Japanese text, Jdoc, Language model, Language modeling, Language modeling approach, Language models, Large number, Markov independence assumption, Maximum entropy, Maximum entropy distribution, Modeling, Multinomial model, Mutual information, Natural language, Natural language processing, Negative examples, Overall accuracy, Peng, Perplexity, Retrieval, Segmentation, Segmentation accuracy, Segmentation performance, Sparse data problems, Standard approaches, Standard techniques, Standard text, Statistical language modeling, Support vector machine, Support vector machines, Support vectors, Table viii, Test corpus, Test document, Text categorization, Text categorization problems, Text retrieval, Training data, Training examples, Uncommon features, Word counts, Word level, Word level features, Word segmentation, Word segmentation accuracies, Word segmentation accuracy, Word segmentation information, Word sequences.
Abstract
Purpose The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Designmethodologyapproach Nave Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentationbased approach was compared with the nonsegmentationbased approach. Findings There were two findings the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications Apply the findings to real web text classification is ongoing work. Originalityvalue The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.
Url:
DOI: 10.1108/00220410710743306
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Istex, to step Corpus: 001D10
- to stream Istex, to step Curation: 001701
- to stream Istex, to step Checkpoint: 000447
- to stream Main, to step Merge: 000498
- to stream Main, to step Curation: 000495
Le document en format XML
<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Machine learning for Asian language text classification</title>
<author><name sortKey="Peng, Fuchun" sort="Peng, Fuchun" uniqKey="Peng F" first="Fuchun" last="Peng">Fuchun Peng</name>
</author>
<author><name sortKey="Huang, Xiangji" sort="Huang, Xiangji" uniqKey="Huang X" first="Xiangji" last="Huang">Xiangji Huang</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:6EF70F3DF40FE6C3EF63BBF334B5EDF3E88053F9</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1108/00220410710743306</idno>
<idno type="url">https://api.istex.fr/ark:/67375/4W2-6TSJ99BC-N/fulltext.pdf</idno>
<idno type="wicri:Area/Istex/Corpus">001D10</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Corpus" wicri:corpus="ISTEX">001D10</idno>
<idno type="wicri:Area/Istex/Curation">001701</idno>
<idno type="wicri:Area/Istex/Checkpoint">000447</idno>
<idno type="wicri:explorRef" wicri:stream="Istex" wicri:step="Checkpoint">000447</idno>
<idno type="wicri:doubleKey">0022-0418:2007:Peng F:machine:learning:for</idno>
<idno type="wicri:Area/Main/Merge">000498</idno>
<idno type="wicri:Area/Main/Curation">000495</idno>
<idno type="wicri:Area/Main/Exploration">000495</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Machine learning for Asian language text classification</title>
<author><name sortKey="Peng, Fuchun" sort="Peng, Fuchun" uniqKey="Peng F" first="Fuchun" last="Peng">Fuchun Peng</name>
<affiliation wicri:level="2"><country xml:lang="fr">États-Unis</country>
<wicri:regionArea>Yahoo Inc., Sunnyvale, California</wicri:regionArea>
<placeName><region type="state">Californie</region>
</placeName>
</affiliation>
</author>
<author><name sortKey="Huang, Xiangji" sort="Huang, Xiangji" uniqKey="Huang X" first="Xiangji" last="Huang">Xiangji Huang</name>
<affiliation wicri:level="1"><country xml:lang="fr">Canada</country>
<wicri:regionArea>School of Information Technology, York University, Toronto</wicri:regionArea>
<wicri:noRegion>Toronto</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="j">Journal of Documentation</title>
<idno type="ISSN">0022-0418</idno>
<imprint><publisher>Emerald Group Publishing Limited</publisher>
<date type="published" when="2007-05-01">2007-05-01</date>
<biblScope unit="volume">63</biblScope>
<biblScope unit="issue">3</biblScope>
<biblScope unit="page" from="378">378</biblScope>
<biblScope unit="page" to="397">397</biblScope>
</imprint>
<idno type="ISSN">0022-0418</idno>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0022-0418</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass><keywords scheme="Teeft" xml:lang="en"><term>Algorithm</term>
<term>Asian language</term>
<term>Asian language text</term>
<term>Bayes</term>
<term>Bayes model</term>
<term>Bayes result</term>
<term>Best compression</term>
<term>Best result</term>
<term>Binary problems</term>
<term>Byte level models</term>
<term>Categorization</term>
<term>Certain level</term>
<term>Character level</term>
<term>Character level features</term>
<term>Chinese data</term>
<term>Chinese data table</term>
<term>Chinese experiments</term>
<term>Chinese text</term>
<term>Chinese word segmentation</term>
<term>Class label</term>
<term>Computational linguistics</term>
<term>Entropy</term>
<term>Exponential form</term>
<term>Feature engineering</term>
<term>Feature selection</term>
<term>Feature space</term>
<term>Higher order models</term>
<term>Ieee transactions</term>
<term>Individual scores</term>
<term>Information retrieval</term>
<term>International conference</term>
<term>Japanese data</term>
<term>Japanese text</term>
<term>Jdoc</term>
<term>Language model</term>
<term>Language modeling</term>
<term>Language modeling approach</term>
<term>Language models</term>
<term>Large number</term>
<term>Markov independence assumption</term>
<term>Maximum entropy</term>
<term>Maximum entropy distribution</term>
<term>Modeling</term>
<term>Multinomial model</term>
<term>Mutual information</term>
<term>Natural language</term>
<term>Natural language processing</term>
<term>Negative examples</term>
<term>Overall accuracy</term>
<term>Peng</term>
<term>Perplexity</term>
<term>Retrieval</term>
<term>Segmentation</term>
<term>Segmentation accuracy</term>
<term>Segmentation performance</term>
<term>Sparse data problems</term>
<term>Standard approaches</term>
<term>Standard techniques</term>
<term>Standard text</term>
<term>Statistical language modeling</term>
<term>Support vector machine</term>
<term>Support vector machines</term>
<term>Support vectors</term>
<term>Table viii</term>
<term>Test corpus</term>
<term>Test document</term>
<term>Text categorization</term>
<term>Text categorization problems</term>
<term>Text retrieval</term>
<term>Training data</term>
<term>Training examples</term>
<term>Uncommon features</term>
<term>Word counts</term>
<term>Word level</term>
<term>Word level features</term>
<term>Word segmentation</term>
<term>Word segmentation accuracies</term>
<term>Word segmentation accuracy</term>
<term>Word segmentation information</term>
<term>Word sequences</term>
</keywords>
</textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract">Purpose The purpose of this research is to compare several machine learning techniques on the task of Asian language text classification, such as Chinese and Japanese where no word boundary information is available in written text. The paper advocates a simple language modeling based approach for this task. Designmethodologyapproach Nave Bayes, maximum entropy model, support vector machines, and language modeling approaches were implemented and were applied to Chinese and Japanese text classification. To investigate the influence of word segmentation, different word segmentation approaches were investigated and applied to Chinese text. A segmentationbased approach was compared with the nonsegmentationbased approach. Findings There were two findings the experiments show that statistical language modeling can significantly outperform standard techniques, given the same set of features and it was found that classification with word level features normally yields improved classification performance, but that classification performance is not monotonically related to segmentation accuracy. In particular, classification performance may initially improve with increased segmentation accuracy, but eventually classification performance stops improving, and can in fact even decrease, after a certain level of segmentation accuracy. Practical implications Apply the findings to real web text classification is ongoing work. Originalityvalue The paper is very relevant to Chinese and Japanese information processing, e.g. webpage classification, web search.</div>
</front>
</TEI>
<affiliations><list><country><li>Canada</li>
<li>États-Unis</li>
</country>
<region><li>Californie</li>
</region>
</list>
<tree><country name="États-Unis"><region name="Californie"><name sortKey="Peng, Fuchun" sort="Peng, Fuchun" uniqKey="Peng F" first="Fuchun" last="Peng">Fuchun Peng</name>
</region>
</country>
<country name="Canada"><noRegion><name sortKey="Huang, Xiangji" sort="Huang, Xiangji" uniqKey="Huang X" first="Xiangji" last="Huang">Xiangji Huang</name>
</noRegion>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Wicri/Informatique/explor/SgmlV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000495 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000495 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Wicri/Informatique |area= SgmlV1 |flux= Main |étape= Exploration |type= RBID |clé= ISTEX:6EF70F3DF40FE6C3EF63BBF334B5EDF3E88053F9 |texte= Machine learning for Asian language text classification }}
This area was generated with Dilib version V0.6.33. |